We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.
Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens but did not include WHIM, CPSA, or any geometrical descriptors.
[include file with descriptors here]
Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."
setwd("C://..")
mydata = read.csv(file="SolventInformationWithDescriptors.csv",head=TRUE,row.names="Key")
ncol(mydata) #get number of columns
[output]
221
[output]
library(caret) ##for feature selection
mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA"
ncol(mydata)
[output]
221
[output]
write.csv(mydata,file="DescriptorsNoNA.csv")
nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
119
[output]
write.csv(mydata,file="DescriptorsNoNZV.csv")
After the above code was executed, two columns corresponding to the two descriptors HybRatio and Kier3 wre deleted due to multiple NA values. Also, the following compounds (rows) were removed due to multiple NA values:
The data that we will use to create our RF models now consists of 138 ethanol enthalpy of solvation values with 86 descriptors (MW was added back to the data for informational purposes), see the histograms below: Histogram of Enthalpy of Solvation Data After Curation and Feature Selection Histogram of Enthalpy of Solvation Data After Curation and Feature Selection by Molecular Weight
The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 42% of the variation and clustered the data into three groups upon the recommendation from the pam function in R. Chemical Space - Clustered Using R
Modeling
##Modeling
library("randomForest")
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
mydata.rf <- randomForest(DeltaHsolv ~ ., data = train,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = DeltaHsolv ~ ., data = train, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 95.31999
% Var explained: 82.69
[output]
saveRDS(mydata.rf, file = "EnthalpyTrainingSet")
training.predict <- predict(mydata.rf,train)
write.csv(training.predict, file = "RFTrainingSetPredict.csv")
test.predict <- predict(mydata.rf,test)
write.csv(test.predict, file = "RFTestSetPredict.csv")
Predicted vs Measured DeltaHsolv for the test-set
The test-set statistics were: AAE 4.76, MSE 39.7, R2 82.9% (We did not include azelaic acid in our calculations).
Create RF model using all data.
library("randomForest")
mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key")
mydata.rf <- randomForest(DeltaHsolv ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = DeltaHsolv ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 28
Mean of squared residuals: 65.72521
% Var explained: 87.47
[output]
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
saveRDS(mydata.rf, file = "EnthalpyOfSolvation")
training.predict <- predict(mydata.rf,mydata)
write.csv(training.predict, file = "AllDataRFPredict.csv")
The most important descriptors (nHBDon, TopoPSA, khs.sOH, AMR) can be seen on the following plot: Variable Importance Plot Results We can see... PCA Colored by AE
Modeling Enthalpy of Solvation in Ethanol
Researchers: Robert G Sisk and Andrew SID LangData
We extracted the experimentally measured enthalpy of solvation in ethanol values for 144 compounds from the Acree Enthaply of Solvation Dataset [1]. Ferrocene was not considered as it is a complex.Descriptors
We used Rajarshi Guha's CDK Descriptor Calculator (v 1.4.6) to calculated the CDK descriptors for all the compounds in our data file. We selected the option to add explicit hydrogens but did not include WHIM, CPSA, or any geometrical descriptors.[include file with descriptors here]
Once descriptors were calculated, we deleted all columns that had a zero standard deviation, including columns with all "NA."
setwd("C://..") mydata = read.csv(file="SolventInformationWithDescriptors.csv",head=TRUE,row.names="Key") ncol(mydata) #get number of columns [output] 221 [output] library(caret) ##for feature selection mydata <- mydata[,colSums(is.na(mydata))<nrow(mydata)] ##remove columns of all entries "NA" ncol(mydata) [output] 221 [output] write.csv(mydata,file="DescriptorsNoNA.csv") nzv <-nearZeroVar(mydata) ##remove zeros and other small variance columns mydata <- mydata[, -nzv] ncol(mydata) [output] 119 [output] write.csv(mydata,file="DescriptorsNoNZV.csv")After the above code was executed, two columns corresponding to the two descriptors HybRatio and Kier3 wre deleted due to multiple NA values. Also, the following compounds (rows) were removed due to multiple NA values:
Compounds Removed
Helium
Neon
Argon
Xenon
Radon
Hydrogen
Feature Selection
Feature selection was performed using the following R code that removed highly correlated descriptors:Description of Data
The data that we will use to create our RF models now consists of 138 ethanol enthalpy of solvation values with 86 descriptors (MW was added back to the data for informational purposes), see the histograms below:Histogram of Enthalpy of Solvation Data After Curation and Feature Selection
Histogram of Enthalpy of Solvation Data After Curation and Feature Selection by Molecular Weight
Principal Component & Cluster Analysis
Cluster Analysis## determining the best number of clusters library(fpc) library(cluster) mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") asw <- numeric(20) for (k in 2:20) asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width asw [output] [1] 0.0000000 0.4144260 0.4179739 0.3358037 0.3437506 0.3564146 0.3714603 0.3462984 0.3108418 0.3074072 0.3351801 0.3365456 0.3388145 0.3351526 0.3109977 0.3156479 0.3159808 [18] 0.3116679 0.3293435 0.3328914 [output] k.best <- which.max(asw) cat("silhouette-optimal number of clusters:", k.best, "\n") #3 ##Cluster Analysis mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") mydata <- scale(mydata) # standardize variables # K-Means Cluster Analysis fit <- kmeans(mydata, 3) #3 cluster solution # get cluster means aggregate(mydata,by=list(fit$cluster),FUN=mean) [output] Group.1 DeltaHsolv ALogP ALogp2 AMR BCUTw.1l BCUTw.1h 1 1 0.2704986 0.2374628 -0.1573636 -0.2879611 0.05055558 0.04366062 2 2 -1.1648089 -1.3120591 0.7743171 0.9234572 -0.19608056 -0.31448339 3 3 -0.7562661 0.2046194 0.1499057 1.7547322 -0.20620319 0.25735471 [output] # append cluster assignment mydata <- data.frame(mydata, fit$cluster) write.csv(mydata, file = "clusters.csv")Principal Component Analysis
##PCA mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") pc1 <- prcomp(mydata, scale. = T) x <- pc1$x summary(pc1) [output] Importance of components%s: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 Standard deviation 4.9562 3.4753 2.9841 2.7209 2.34156 2.04465 1.81481 1.63781 1.54550 Proportion of Variance 0.2823 0.1388 0.1023 0.0851 0.06302 0.04805 0.03786 0.03083 0.02745 Cumulative Proportion 0.2823 0.4212 0.5235 0.6086 0.67163 0.71969 0.75754 0.78838 0.81583 [output] write.csv(x, file = "PCA.csv")The following image depicts the chemical space that will be inputted into R to create the random forest model. We used 2 principal components that explain 42% of the variation and clustered the data into three groups upon the recommendation from the pam function in R.
Chemical Space - Clustered Using R
Modeling
##Modeling library("randomForest") mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") ## 75% of the sample size smp_size <- floor(0.75 * nrow(mydata)) ## set the seed to make your partition reproductible set.seed(123) train_ind <- sample(seq_len(nrow(mydata)), size = smp_size) train <- mydata[train_ind, ] test <- mydata[-train_ind, ] mydata.rf <- randomForest(DeltaHsolv ~ ., data = train,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = DeltaHsolv ~ ., data = train, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28 Mean of squared residuals: 95.31999 % Var explained: 82.69 [output] saveRDS(mydata.rf, file = "EnthalpyTrainingSet") training.predict <- predict(mydata.rf,train) write.csv(training.predict, file = "RFTrainingSetPredict.csv") test.predict <- predict(mydata.rf,test) write.csv(test.predict, file = "RFTestSetPredict.csv")Predicted vs Measured DeltaHsolv for the test-set
The test-set statistics were: AAE 4.76, MSE 39.7, R2 82.9% (We did not include azelaic acid in our calculations).
Create RF model using all data.
library("randomForest") mydata = read.csv(file="SolventInformationFeatureSelected.csv",head=TRUE,row.names="Key") mydata.rf <- randomForest(DeltaHsolv ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = DeltaHsolv ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 28 Mean of squared residuals: 65.72521 % Var explained: 87.47 [output] varImpPlot(mydata.rf,main="Random Forest Variable Importance") saveRDS(mydata.rf, file = "EnthalpyOfSolvation") training.predict <- predict(mydata.rf,mydata) write.csv(training.predict, file = "AllDataRFPredict.csv")The most important descriptors (nHBDon, TopoPSA, khs.sOH, AMR) can be seen on the following plot:
Variable Importance Plot
Results
We can see...
PCA Colored by AE
References
1. Wiiliam E. Acree and Andrew SID Lang. Acree Enthalpy of Solvation Dataset. figshare. (2015). https://doi.org/10.6084/m9.figshare.1572326.v1